About this project

This project is about learning data science using R.

Github

More information can be found in my Github repository


RStudio Exercise 2: Regression and model validation

Data Wrangling

See file learning2014.r in my GitHub repository

Data analysis for RStudio Exercise 2

Install required packages

First, we install the required packages for the analysis: ggplot2, GGally, gridExtra (for multiple plots)

install.packages("ggplot2", repos = "http://cloud.r-project.org")
install.packages("GGally", repos = "http://cloud.r-project.org")
install.packages("gridExtra", repos = "http://cloud.r-project.org")

Read data from the web:

lrn14 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=TRUE)

Assess the data structure:

summary(lrn14)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

The dataset is used to study the relationship between learning approaches and students’ achievements in an introductory statistics course in Finland. It consists of 166 observations of 7 variables. The sample consists of 110 females and 56 males. The respondents’ age ranges between 17 and 55. The median age is 22.

str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
  1. age describes the age of the respondent.
  2. gender describes the sex of the respondent.
  3. points describes the points of the respondent.
  4. deep is a composite variable that measures the respondents’ inclination to deep learning methods.
  5. stra is a composite variable that measures the respondents’ inclination to strategic learning methods.
  6. surf is a composite variable that measures the respondents’ inclination to surface learning methods.
  7. attitude is a composite variable that measures the respondents’ attitude towards learning statistics.

Summary statistics of the learning data

First, we load the required libraries (GGally, ggplot2, gridExtra)

library(ggplot2)
library(GGally)
library(gridExtra)

Plot summary of learning data with ggpairs()

Overview of the learning data:

ggpairs(lrn14, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20))) + ggtitle("Learning 2014: Summary Statistics")

Next, we plot the each learning approach against the data to assess correlation between the variables.

Next, we plot all learning variables and points

# initialize plots with data and aesthetic mapping

# attitude vs points
p1 <- ggplot(lrn14, aes(x = attitude, y=points, col=gender)) + geom_point() + ggtitle("Attitude vs points") + theme(legend.position="bottom")

# stra vs points
p2 <- ggplot(lrn14, aes(x = stra, y=points, col=gender)) + geom_point() + ggtitle("Strategic learning vs points")

# surf vs points
p3 <- ggplot(lrn14, aes(x = surf, y=points, col=gender)) + geom_point() + ggtitle("Surface learning vs points")

# deep vs points
p4 <- ggplot(lrn14, aes(x = deep, y=points, col=gender)) + geom_point() + ggtitle("Deep learning vs points")

# create common legend
g_legend<-function(a.gplot){
  tmp <- ggplot_gtable(ggplot_build(a.gplot))
  leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
  legend <- tmp$grobs[[leg]]
  return(legend)}

mylegend<-g_legend(p1)

# multiple plot
grid.arrange(arrangeGrob(p1 + theme(legend.position="none"),
                               p2 + theme(legend.position="none"),
                               p3 + theme(legend.position="none"),
                               p4 + theme(legend.position="none"),
                               nrow=2),
                   mylegend, nrow=2,heights=c(10,1))

Attitude and study achievement seems to be correlated, whether there does not appear correlation with study achievement and different learning approaches.

Next, we plot histograms for each variable according to gender

p6 <- ggplot(lrn14, aes(x=attitude, fill=gender, color=gender)) +
  geom_histogram(position="identity", alpha=0.5) 

p7 <- ggplot(lrn14, aes(x=deep, fill=gender, color=gender)) +
  geom_histogram(position="identity", alpha=0.5)

p8 <- ggplot(lrn14, aes(x=stra, fill=gender, color=gender)) +
  geom_histogram(position="identity", alpha=0.5)

p9 <- ggplot(lrn14, aes(x=surf, fill=gender, color=gender)) +
  geom_histogram(position="identity", alpha=0.5)

p10 <- ggplot(lrn14, aes(x=points, fill=gender, color=gender)) +
  geom_histogram(position="identity", alpha=0.5)

p11 <- ggplot(lrn14, aes(x=points, fill=gender, color=gender)) +
  geom_histogram(position="identity", alpha=0.5)


grid.arrange(arrangeGrob(p6 + theme(legend.position="none"),
                         p7 + theme(legend.position="none"),
                         p8 + theme(legend.position="none"),
                         p9 + theme(legend.position="none"),
                         p10 + theme(legend.position="none"),
                         p11 + theme(legend.position="none"),
                         nrow=3),
             mylegend, nrow=2,heights=c(10,1), top="Learning 2014: Frequencies")

Males appear to have a generally more positive attitude towards statistics than females, but there does not appear to be other differences between them.

Regression analysis

Next, we move to regression analysis to determine what variables affect study achivement (points).

  • Model 1:
  • Dependent variable: points
  • Explanatory variables: attitude, stra, surf
my_model1 <- lm(points ~ attitude + stra + surf, data = lrn14)
summary(my_model1)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Variables stra and surf are not statistically significant. Drop surf as explanatory variable, because p-value is greatest.

  • Model 2:
  • Dependent variable: points
  • Explanatory variables: attitude, stra
my_model2 <- lm(points ~ attitude + stra, data = lrn14)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Variables stra is not statistically significant at the 5% level. Keep attitude as only explanatory variable with intercept, because variable is statistically significant at the 5% level.

  • Model 3:
  • Dependent variable: points
  • Explanatory variables: attitude
my_model3 <- lm(points ~ attitude, data = lrn14)
summary(my_model3)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The explanatory variable attitude and the intercept ar statistically significant. The model (model 3) may be interpreted such that a one point increase in attitude increases test points by 3.5.

Model validation

par(mfrow = c(1,3))
plot(my_model3, which=c(1,2,5))

Residuals appear to be randomly distributed and centered around zero, which suggests that the model satisifies the condition that the errors are not serially correlated.

The Q-Q-plot shows that the relationship between the sample percentiles and the theoretical samples from the normal distribution is linear, which suggests that the model satisfies the condition that the errors are normally distributed.

The relationship bewteen residuals and leverage shows that there are no outliers in the data that significantly affect the regression results.


RStudio Exercise 3: Logistic regression

Data Wrangling

See file create_alc.R in my GitHub repository

Data analysis for RStudio Exercise 3

Install required packages

First, we install the required packages for the analysis: tidyr, ggplot2, GGally, gridExtra (for multiple plots)

install.packages("tidyr", repos = "http://cloud.r-project.org")
install.packages("ggplot2", repos = "http://cloud.r-project.org")
install.packages("GGally", repos = "http://cloud.r-project.org")
install.packages("gridExtra", repos = "http://cloud.r-project.org")

and make the packages available

library(tidyr); library(dplyr); library(ggplot2); library(gridExtra)

Read data from data folder:

alc <- read.table("C:/Users/palme/Documents/IODS-project/data/alc.csv", sep=",", header=TRUE)

Glimpse at the alc data

glimpse(alc) 
## Observations: 382
## Variables: 35
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob       <fct> teacher, other, other, services, other, other, othe...
## $ reason     <fct> course, course, other, home, home, reputation, home...
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian   <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...

Dataset information:

This dataset contains measures of student achievement of two Portuguese schools. The data include student grades, demographic, social and school related features as additional variables). The dataset provides average performance in two distinct subjects: mathematics and Portuguese language.

The dataset consists of 35 variables and 382 observations

Variables:

  1. school - student’s school
  2. sex - student’s sex
  3. age - student’s age
  4. address - urban or rural
  5. famsize - family size
  6. Pstatus - parent’s cohabitation status
  7. Medu - mother’s education
  8. Fedu - father’s education
  9. Mjob - mother’s job
  10. Fjob - father’s job
  11. reason - reason to choose this school
  12. guardian - student’s guardian
  13. traveltime - home to school travel time
  14. studytime - weekly study time
  15. failures - number of past class failures
  16. schoolsup - extra educational support
  17. famsup - family educational support
  18. paid - extra paid classes
  19. activities - extra-curricular activities
  20. nursery - attended nursery school
  21. higher - wants to take higher education
  22. internet - Internet access at home
  23. romantic - with a romantic relationship
  24. famrel - quality of family relationships
  25. freetime - free time after school
  26. goout - going out with friends
  27. Dalc - workday alcohol consumption 28 Walc - weekend alcohol consumption use
  28. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
  29. health - health status
  30. absences - number of absences
  31. G1 - first period average grade (math and portuegese)
  32. G2 - second period average grade (math and portuegese)
  33. G3 - final period average grade (math and portuegese)
  34. alc_use’ - average of ‘Dalc’ and ‘Walc’
  35. ‘high_use’ -is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise

Analysis of the causes of high alcohol consumption

First we analyze the data, by drawing a bar plot of each variable

gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

Main hypothesis

There is quite a lot of variation between respondents’ free time and study time, so we use these variables along with age and sex as explanatory variables for alcohol consumption. The main hypothesis is that respondents with a lot of free time also have time to consume alcohol. Moreover, respondents that report many hours of study choose not to consume alcohol, which might decrease their study ability to study. Morever, age might be positively correlated with alcohol consumption, as older respondents drink more than their younger counterparts. Finally, there might be gender differences with respect to alcohol consumption.

We study the relationship between high_use and studytime, age, freetime, sex First, we produce summary statistics by group

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean(studytime), mean(age), mean(freetime))
## # A tibble: 4 x 6
## # Groups:   sex [2]
##   sex   high_use count `mean(studytime)` `mean(age)` `mean(freetime)`
##   <fct> <lgl>    <int>             <dbl>       <dbl>            <dbl>
## 1 F     FALSE      156              2.34        16.6             2.93
## 2 F     TRUE        42              2           16.5             3.36
## 3 M     FALSE      112              1.88        16.3             3.39
## 4 M     TRUE        72              1.64        17.0             3.5

Graphical illustration of data:

g1 <- ggplot(alc, aes(x = high_use, y = age, col = sex)) #plot high_use and age, separate by sex
g1 + geom_boxplot() + ggtitle("Age by alcohol consumption and sex") # define the plot as a boxplot and draw it

g2 <- ggplot(alc, aes(x = high_use, y = studytime, col = sex)) # plot high_use and freetime by sex

g2 + geom_boxplot() + ggtitle("Study time by alcohol consumption and sex") # define the plot as a boxplot and draw it

g3 <- ggplot(alc, aes(x = high_use, y = freetime, col = sex)) #plot of high_use and freetime

g3 + geom_boxplot() + ggtitle("Free time by alcohol consumption and sex") # define the plot as a boxplot and draw it

Main findings:

  • The mean age of males that report high use of alcohol is higher than those males who report low level drinking.
  • The mean age of females that report high use of alcohol is lower than those consume less alcohol, although the mean age for both groups is almost the same.
  • Males and females who report high use of alcohol on average have more free time than those who consume less alcohol.
  • Males and females who report high use of alchol on average study less than those who consume less alcohol.
  • Higher share of men report high alcohol use than women. (0.64 vs 0.27)

Logistic regression

Run logistic regression with high_use as the dependent variable and studytime, age, freetime, sex as explanatory variables

m <- glm(high_use ~ age + studytime + freetime + sex, data = alc, family = "binomial")

The summary of the model

summary(m)
## 
## Call:
## glm(formula = high_use ~ age + studytime + freetime + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6526  -0.8509  -0.6602   1.1301   2.2095  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -4.8768     1.7790  -2.741  0.00612 **
## age           0.2290     0.1011   2.265  0.02349 * 
## studytime    -0.4697     0.1602  -2.931  0.00337 **
## freetime      0.2503     0.1229   2.037  0.04166 * 
## sexM          0.5854     0.2482   2.359  0.01833 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 431.41  on 377  degrees of freedom
## AIC: 441.41
## 
## Number of Fisher Scoring iterations: 4

All coefficients are statistically significant at the 5% level, suggesting that the chosen variables explain alcohol consumption.

Compute the Odds Ratios (OR)

OR <- coef(m) %>% exp

and confidence intervals

CI <- confint(m) %>% exp

The Odds Ratios and confidence intervals are provided in the table below:

cbind(OR, CI)
##                      OR        2.5 %    97.5 %
## (Intercept) 0.007621647 0.0002191106 0.2381440
## age         1.257384089 1.0338141044 1.5381420
## studytime   0.625158887 0.4522877516 0.8490039
## freetime    1.284393173 1.0119391550 1.6399589
## sexM        1.795640894 1.1056484552 2.9303141

The odds ratios can be interpreted as the change in the odds of high alcohol use given a one unit increase in the explanatory variable. For example, the regression coefficient for age is 0.2290334. This indicates that a one unit increase in age increases the odds of being high alcohol use by exp(0.0.2290334)=1.3) times.

To summarize, the amount of free time, age, and being male increase the odds of high alcohol consumption, whereas the amount of study time decreases the odds of high alcohol use, as hypothesized above.

Exploring the predictive power of the model

First make the prediction of the prediction

probabilities <- predict(m, type = "response") # predict() the probability of high_use
alc <- mutate(alc, probability = probabilities) # add the predicted probabilities to 'alc'
alc <- mutate(alc, prediction = probability > 0.5) # use the probabilities to make a prediction of high_use

and check to see the predictions of the last 10 observations

select(alc, studytime, freetime, age, sex, high_use, probability, prediction) %>% tail(10) # see the last ten original classes, predicted probabilities, and class predictions
##     studytime freetime age sex high_use probability prediction
## 373         1        3  19   M    FALSE   0.5845170       TRUE
## 374         1        4  18   M     TRUE   0.5896690       TRUE
## 375         3        3  18   F    FALSE   0.1958322      FALSE
## 376         1        4  18   F    FALSE   0.4445380      FALSE
## 377         3        4  19   F    FALSE   0.2822698      FALSE
## 378         2        3  18   F    FALSE   0.2803350      FALSE
## 379         2        2  18   F    FALSE   0.2327073      FALSE
## 380         2        1  18   F    FALSE   0.1910235      FALSE
## 381         1        4  17   M     TRUE   0.5333414       TRUE
## 382         1        4  18   M     TRUE   0.5896690       TRUE

Predictions vs actual values

table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   255   13
##    TRUE     95   19
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction)) # plot 'high_use' versus 'probability' in 'alc'
g + geom_point() 

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66753927 0.03403141 0.70157068
##    TRUE  0.24869110 0.04973822 0.29842932
##    Sum   0.91623037 0.08376963 1.00000000
#define a loss function (average prediction error)
  loss_func <- function(class, prob) {
    n_wrong <- abs(class - prob) > 0.5
    mean(n_wrong)
  }

There is a high number of false negatives, i.e., that the model predicts low alcohol use, wheras the number of false positives is quite low. However, this suggests that the predictive ability of the model is quite low. This could be adjusted by lowering the acceptance probability of the model.

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2827225
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

On average the model makes wrong predictions 0.2827225 percent of the time, which is better than the average number of wrong predictions given random guess:

# compute the average number of wrong predictions in the (training) data using random predictions
loss_func(class = alc$high_use, prob = runif(length(alc$high_use))>0.5)
## [1] 0.4842932

As the number of predictions goes to infinity, the average share of wrong predictions should be 0.5.

Perform K-fold cross-validation

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

The average number of wrong predictions in the cross validation is 0.2879581, which is higher compared to the model in DataCamp. A better model could probably be found using more/other explanatory variables.

Such model can be where high_use is explained bystudytime, sex, and goout.

m2 <- glm(high_use ~ studytime + sex + goout, data = alc, family = "binomial")

summary(m2)
## 
## Call:
## glm(formula = high_use ~ studytime + sex + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7187  -0.8058  -0.5224   0.8849   2.6300  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.7318     0.5683  -4.807 1.53e-06 ***
## studytime    -0.4822     0.1683  -2.865  0.00417 ** 
## sexM          0.6722     0.2581   2.605  0.00919 ** 
## goout         0.7519     0.1186   6.340 2.30e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 393.95  on 378  degrees of freedom
## AIC: 401.95
## 
## Number of Fisher Scoring iterations: 4
cv2 <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)

# average number of wrong predictions in the cross validation
cv2$delta[1]
## [1] 0.2408377

RStudio Exercise 4: Clustering and classification

Housekeeping

First, install required packages:

install.packages('plotly', repos = "http://cloud.r-project.org")
install.packages('dplyr', repos = "http://cloud.r-project.org")
install.packages('ggplot2', repos = "http://cloud.r-project.org")
install.packages('GGally', repos = "http://cloud.r-project.org")
install.packages("corrplot", repos = "http://cloud.r-project.org")

Access the required libraries:

library(dplyr)
library(MASS)
library(ggplot2)
library(GGally)

Summary of the data

For this exercise, we are using the Boston dataset, which contains information on housing in Boston Massachusetts. For more information, see this website

We start out by loading the data from the MASS package.

# load the data
data("Boston")

Explore the dataset Boston:

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Dataset contains 506 observations of 14 variables.

  1. CRIM - per capita crime rate by town
  2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS - proportion of non-retail business acres per town.
  4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  5. NOX - nitric oxides concentration (parts per 10 million)
  6. RM - average number of rooms per dwelling
  7. AGE - proportion of owner-occupied units built prior to 1940
  8. DIS - weighted distances to five Boston employment centres
  9. RAD - index of accessibility to radial highways
  10. TAX - full-value property-tax rate per $10,000
  11. PTRATIO - pupil-teacher ratio by town
  12. BLACK - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT - % lower status of the population
  14. MEDV - Median value of owner-occupied homes in $1000’s
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

We can notice that the variable chas is a dummy variable (either 0 or 1), but other variables are continuous and have varying means and variances.

We can see this more clearly by visualizing the data.

Visualize the data

# use ggplot to explore the data
#ggplot(Boston, aes(x=zn)) + geom_histogram(binwidth=50)

ggpairs(Boston, lower = list(combo = wrap("facethist", bins = 20)))

We can also plot the correlation between variables:

library(corrplot)
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

There appears to be strong correlation between crime (crim) and pupil-teacher-ratio (ptratio) and accessibility to highways (rad)

Standardize the data for the linear discrimination analysis (LDA)

boston_scaled <- scale(Boston) # center and standardize variables

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Looking at the scaled data, we can see that each variable is now centered around zero, i.e. the mean is zero by definition, whereas the mean varied in the original data.

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

## Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). 
# Use the quantiles as the break points in the categorical variable.

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

LDA on the train set

First, we fit the linear discriminant analysis on the train set by using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.

# fit the LDA model
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2574257 0.2425743 0.2475248 0.2524752 
## 
## Group means:
##                  zn      indus         chas        nox         rm
## low       0.9520298 -0.8856000 -0.083045403 -0.8619406  0.4706172
## med_low  -0.1376603 -0.2327100 -0.031282114 -0.5228258 -0.1062960
## med_high -0.3740445  0.1959843  0.239493963  0.4002060  0.1363554
## high     -0.4872402  1.0149946 -0.002135914  1.0369373 -0.4104222
##                 age        dis        rad        tax     ptratio
## low      -0.8738534  0.8306651 -0.6892330 -0.7741769 -0.43914542
## med_low  -0.2662015  0.3097855 -0.5517820 -0.4519226 -0.05016053
## med_high  0.4276229 -0.3892234 -0.4030438 -0.3026464 -0.38085651
## high      0.8047064 -0.8666019  1.6596029  1.5294129  0.80577843
##               black       lstat         medv
## low       0.3807844 -0.78117908  0.566532993
## med_low   0.3209599 -0.07684817 -0.005896954
## med_high  0.0865215  0.04436830  0.203345929
## high     -0.7731102  0.92384457 -0.677692268
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.127259674  0.78372288 -0.89800568
## indus   -0.001768243 -0.27192166  0.21510195
## chas    -0.059502703 -0.07712873  0.01585696
## nox      0.265670346 -0.61718597 -1.38625012
## rm      -0.110095554 -0.11836129 -0.10512960
## age      0.287763231 -0.27777305 -0.11334077
## dis     -0.116482227 -0.26569519  0.21409019
## rad      3.512512373  0.86204829 -0.26362288
## tax     -0.031302743 -0.09278550  0.79407314
## ptratio  0.149263570  0.21118561 -0.19150248
## black   -0.164541439 -0.03097913  0.10202188
## lstat    0.215160692 -0.27288347  0.42506707
## medv     0.194213785 -0.28475652 -0.11964034
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9551 0.0334 0.0114

We can see that the first linear discriminant (LD1) explains most of the differences between groups, as the proportio of the trace is the very large (0.9563)

We can see this more clearly by plotting the results:

# create the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

First linear discriminant (LD1) explains the high crime areas, but cannot predict differences between the lower crime neighborhoods. Second linear discriminant (LD2) can explain some differences between with the low crime neighbordhoods, with a lower value of LD2 associated with a higher crime levels.

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

summary(test)
##        zn               indus               chas        
##  Min.   :-0.48724   Min.   :-1.40179   Min.   :-0.2723  
##  1st Qu.:-0.48724   1st Qu.:-0.95539   1st Qu.:-0.2723  
##  Median :-0.48724   Median :-0.43683   Median :-0.2723  
##  Mean   : 0.01552   Mean   :-0.08059   Mean   :-0.1179  
##  3rd Qu.: 0.34886   3rd Qu.: 1.01499   3rd Qu.:-0.2723  
##  Max.   : 3.15732   Max.   : 2.11552   Max.   : 3.6648  
##       nox                 rm               age                dis         
##  Min.   :-1.40402   Min.   :-2.7279   Min.   :-2.15905   Min.   :-1.1919  
##  1st Qu.:-0.94665   1st Qu.:-0.5848   1st Qu.:-0.93254   1st Qu.:-0.6948  
##  Median :-0.26921   Median :-0.2471   Median : 0.04885   Median :-0.1424  
##  Mean   :-0.04813   Mean   :-0.1010   Mean   :-0.07719   Mean   : 0.1036  
##  3rd Qu.: 0.59809   3rd Qu.: 0.3218   3rd Qu.: 0.89258   3rd Qu.: 0.6865  
##  Max.   : 2.72965   Max.   : 3.4733   Max.   : 1.11639   Max.   : 3.2841  
##       rad                tax               ptratio        
##  Min.   :-0.98187   Min.   :-1.306758   Min.   :-2.70470  
##  1st Qu.:-0.63733   1st Qu.:-0.716383   1st Qu.:-0.44137  
##  Median :-0.52248   Median :-0.452346   Median : 0.22840  
##  Mean   :-0.03157   Mean   :-0.009144   Mean   : 0.06356  
##  3rd Qu.:-0.17794   3rd Qu.: 0.313064   3rd Qu.: 0.80578  
##  Max.   : 1.65960   Max.   : 1.529413   Max.   : 1.63721  
##      black               lstat               medv         
##  Min.   :-3.878357   Min.   :-1.36017   Min.   :-1.84110  
##  1st Qu.: 0.178909   1st Qu.:-0.71006   1st Qu.:-0.43577  
##  Median : 0.389737   Median :-0.18107   Median :-0.16666  
##  Mean   :-0.008339   Mean   :-0.09701   Mean   :-0.09364  
##  3rd Qu.: 0.429827   3rd Qu.: 0.34056   3rd Qu.: 0.15409  
##  Max.   : 0.440616   Max.   : 3.40663   Max.   : 2.98650
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       17       6        0    0
##   med_low    5      20        3    0
##   med_high   1       9       15    1
##   high       0       0        1   24

The results in the above table show that the model predicts high crime areas well, but low, medium-low and medium-high crime areas are more difficult to predict. The model is unable to place low crime areas into the low category, given that it predicts low areas to belong to low and medium-low areas with the same probability.

K-means

#Reload the data from the MASS package
library(MASS)
data('Boston')

#Scale the data
boston_scaled <- scale(Boston)

Next, we calculate the distances between variables

# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method = 'manhattan')

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

We can see that the Euclidean distances are shorter than the Manhattan distances, because the Euclidean distance calculates the shortest distance between variables, whereas the Manhattan distance measure the distance between variables along the axes at right angles.

Next, we look at the results from k-means clustering with optimal number of clusters

set.seed(123) # Fix random number generator

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

Because, the total within sum-of-squares decreases the fastest between one and two clusters, we perform k-means clustering with two centers.

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Then we plot the original dataset with the clusters

# plot the original Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

# k-means clustering
km <-kmeans(boston_scaled, centers = 4)
summary(km)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       56    -none- numeric
## totss          1    -none- numeric
## withinss       4    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           4    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
# plot the Boston dataset with clusters
pairs(boston_scaled[, 1:6], col = km$cluster)

# plot the Boston dataset with clusters
pairs(boston_scaled[, 7:14], col = km$cluster)

Bonus

Next, we perform linear discriminant analysis with the Boston data by using the clusters from the k-means testing as the target variables.

# Read data
data(Boston)
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
# Scale data
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)


# Kmeans with three centers
km1 <-kmeans(boston_scaled, centers = 4)

lda.fit1 <- lda(km1$cluster ~ ., data = boston_scaled)


# target classes as numeric
classes <- as.numeric(km1$cluster)

# plot the lda results
plot(lda.fit1, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit1, myscale = 1)

We can see that the linear discriminants are useful at explaining the different clusters along the two dimensions, because the clusters are quite separate from one another.

Super Bonus

Use package plotly to create 3D images

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

RStudio Exercise 5

Data wrangling

See file create_human.R in my GitHub repository

Analysis

First, load the data

human = read.table("data/human.csv", header=TRUE, sep=" ")

Access library GGally, corrplot to be used in the analysis

library(GGally)
library(corrplot)
library(dplyr)

Next we summarize the data

summary(human)
##    Edu2.Ratio       LFP.Ratio         Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

We can see that lowest life expectancy (Life.Exp) is 49 years and the highest is 83.5. The data also shows that the share of female participants in parliament (Parli.F) varies between countries, ranging from 0 to 57.5 %. On average, females obtain less education than males (Edu2.Ratio) and particpate less in the labor force (LFP.Ratio)

Data visualization

Next, we visualize the data

# visualize the 'human' variables
ggpairs(human)

The female-to-male secondary education ratio and labor force participation ratio have a lot of weight on the right tail of the distribution, which suggests that female participation in secondary education and the labor force is high in the sample. Moreover, most countries in the sample are poor, given that the GNI is skewed to the left.

Looking in more detail at the correlations in the figure below, we can see some expected relationships.

# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot

Maternal mortality (Mat.Mor) and adolescent fertility rate (Ado.Birth) are negatively correlated with relative education of female to male (Edu2.Ratio) and gross national income (GNI). GNI is positively correlated with female education Edu2.Ratio and life expectancy (Life.Exp) and expected education (Edu.Exp).

Principal components analysis (non-standardized data)

Next we perform principal components analysis (PCA) on the data.

# perform principal component analysis
pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000

The above table shows that the first principal component (PC1) explains 99% of the variation in the data, suggesting that the data can be summarized in single dimension.

biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

The biplot shows that the first principal component explain much of the cross-country variation, which is largely driven by gross national income (GNI).

Principle components analysis (standardized data)

# perform principal component analysis (with the SVD method)
human_std <- scale(human)
# perform principal component analysis (with the SVD method)
pca_human_std <- prcomp(human_std)
summary(pca_human_std)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000

Using the standardized data, we can see that now the first five principal components explain 92% of the variation in the data, whereas the first principal component explained most of the variation in the non-standardized data.

This can be largely explained by the the large values of the GNI that are driving the results, whereas this is avoided by using the standardized data.

This can be seen clearly in the figure below:

# draw a biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

We can see that gross national income (GNI), female participation in secondary education (Edu2.Ratio), expected years of education (Edu.Exp) and life expectancy (Life.Exp) have large negative weights on the first principal component (PC1), whereas maternal mortality (Mat.Mor) and adolescent birth rate (Ado.Birth) have positive weights on PC1.

Moreover, female participation in the labor force relative to men (LFP.Ratio) and participation in the parliamenet (Parli.F) have positive weights on PC2.

Interpretation of principal components

The above suggests that PC1 seems to explain economic differences between countries, given that variables associated with economic development, such as gross national income and health have the largest weights on PC1.

PC2 may be interpreted as a measure of gender inequality, because it is largely driven by female participation in the parliament and relative labor force participation.

Tea dataset from FactoMineR

Install package FactoMineR, load dataset tea. The data is on a questionnaire on tea consumption, product perception and personal details.

First, explore that data:

install.packages('FactoMineR', repos = "http://cloud.r-project.org")
library(FactoMineR)
data(tea)
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

The data includes 300 observations of 36 variables.

# visualize the dataset
library(tidyr)
gather(tea) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Next we select the variables of interest: Tea, How, how, where, price, age, and lunch

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "where")

# select the 'keep_columns' to create a new dataset
tea_vars <- select(tea, one_of(keep_columns))
head(tea_vars)
##         Tea   How     how       where
## 1     black alone tea bag chain store
## 2     black  milk tea bag chain store
## 3 Earl Grey alone tea bag chain store
## 4 Earl Grey alone tea bag chain store
## 5 Earl Grey alone tea bag chain store
## 6 Earl Grey alone tea bag chain store
# multiple correspondence analysis
mca <- MCA(tea_vars, graph = FALSE)
# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_vars, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.413   0.386   0.293   0.262   0.248   0.225
## % of var.             18.368  17.142  13.020  11.650  11.026   9.994
## Cumulative % of var.  18.368  35.510  48.531  60.181  71.207  81.201
##                        Dim.7   Dim.8   Dim.9
## Variance               0.197   0.131   0.095
## % of var.              8.735   5.842   4.222
## Cumulative % of var.  89.936  95.778 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.313  0.079  0.080 | -0.333  0.096  0.090 |  0.653
## 2                  | -0.427  0.147  0.090 | -0.143  0.018  0.010 |  0.980
## 3                  | -0.531  0.228  0.466 | -0.365  0.115  0.221 | -0.188
## 4                  | -0.531  0.228  0.466 | -0.365  0.115  0.221 | -0.188
## 5                  | -0.531  0.228  0.466 | -0.365  0.115  0.221 | -0.188
## 6                  | -0.531  0.228  0.466 | -0.365  0.115  0.221 | -0.188
## 7                  | -0.531  0.228  0.466 | -0.365  0.115  0.221 | -0.188
## 8                  | -0.427  0.147  0.090 | -0.143  0.018  0.010 |  0.980
## 9                  |  0.058  0.003  0.001 |  1.078  1.005  0.497 | -0.234
## 10                 |  0.391  0.123  0.071 |  0.921  0.734  0.394 |  0.280
##                       ctr   cos2  
## 1                   0.485  0.347 |
## 2                   1.094  0.472 |
## 3                   0.040  0.059 |
## 4                   0.040  0.059 |
## 5                   0.040  0.059 |
## 6                   0.040  0.059 |
## 7                   0.040  0.059 |
## 8                   1.094  0.472 |
## 9                   0.062  0.023 |
## 10                  0.089  0.036 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.348   1.808   0.040   3.444 |   0.177   0.501
## Earl Grey          |  -0.214   1.775   0.082  -4.960 |   0.095   0.380
## green              |   0.468   1.460   0.027   2.848 |  -0.955   6.508
## alone              |  -0.038   0.055   0.003  -0.885 |  -0.262   2.896
## lemon              |   0.831   4.595   0.085   5.052 |   0.557   2.214
## milk               |  -0.331   1.396   0.029  -2.955 |   0.209   0.595
## other              |   0.087   0.014   0.000   0.264 |   2.175   9.196
## tea bag            |  -0.591  11.987   0.457 -11.693 |  -0.363   4.851
## tea bag+unpackaged |   0.287   1.556   0.037   3.347 |   1.010  20.715
## unpackaged         |   2.044  30.334   0.570  13.053 |  -0.921   6.596
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.010   1.752 |   1.360  38.943   0.606  13.458 |
## Earl Grey            0.016   2.217 |  -0.461  11.679   0.384 -10.711 |
## green                0.113  -5.808 |  -0.353   1.167   0.015  -2.143 |
## alone                0.128  -6.178 |  -0.171   1.620   0.054  -4.027 |
## lemon                0.038   3.387 |  -0.893   7.483   0.099  -5.428 |
## milk                 0.012   1.863 |   0.538   5.183   0.077   4.795 |
## other                0.146   6.613 |   3.212  26.413   0.319   9.768 |
## tea bag              0.173  -7.186 |   0.131   0.830   0.022   2.591 |
## tea bag+unpackaged   0.465  11.797 |  -0.378   3.830   0.065  -4.421 |
## unpackaged           0.116  -5.880 |   0.369   1.398   0.019   2.359 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.083 0.114 0.607 |
## How                | 0.100 0.230 0.477 |
## how                | 0.725 0.496 0.071 |
## where              | 0.744 0.703 0.017 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

The above figure shows the association between the different categories. Unpackaged tea is consumed in a tea shop, whereas bagged tea is purchased from a chain store. Earl gray tea is more likely to be consumed with milk and black tea with lemon and green tea is consumed alone.

Dimension 1 and Dimension 2 explain 18,4 % and 17,1% of the variation, respectively. The summary of the model shows that first four dimensions explain roughly 60 percent of the variation in the model.


RStudio Exercise 6

Data Wrangling

See file meet_and_repeat.R in my GitHub repository

Analysis of longitudinal data: Part I

In this part we will learn to analyze longitudinal data using graphical displays and and the summary measure approach. We’ll be using the RATS data from a nutrition study for three groups of rats, which treated with different diets. Each rat’s body weight was recorded weekly. We are interested in learning whether differences in their diets affected their growth during the observation period.

First, we Read the RATS data in long form:

We also need to check the structure of the variables:

str(RATSL)
## 'data.frame':    176 obs. of  4 variables:
##  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...

ID and Group are now assigned to be integers, so we need to change categorical values to factors in RATSL data:

RATSL$ID <- as.factor(RATSL$ID)
RATSL$Group <- as.factor(RATSL$Group)

We also change Time to be numeric

RATSL$Time <- as.numeric(RATSL$Time)

Graphic displays of RATS data

We begin with a graphical display of RATS data by plotting the points for each subject to show the weight development of individual rats.

Access library ggplot2 to create plots

#Access the package ggplot2
library(ggplot2)

Plotting the RATS data

We can see that the average initial weight for each group of rats is different.

To see the changes in weights more clearly, We can also plot the data separately for each group:

We can make a few observations from the data: First, rats’ weight seems to be increasing in most of the cases over time in the sample. Second, there are quite large differences between the groups, and the weights between groups do not converge. However, the weights’ rate of change does not vary much within each group. However, there rate of change seems to vary between groups, given that the change in weight is larger in groups 2 and 3 than group 1 over the sample.

We can further study the differences and similarities between groups by looking at the standardized data

## Observations: 176
## Variables: 5
## $ ID        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ Group     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1...
## $ Time      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8...
## $ Weight    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 44...
## $ stdWeight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8...

The above figure plots the standardized data for each group. We can see similar features in the above figure as in the non-standardized data, although differences between rates of changes between different groups is now less prononunced.

## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ Time  <dbl> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, ...
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 26...
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.5529...
##  Group       Time            mean             se        
##  1:11   Min.   : 1.00   Min.   :250.6   Min.   : 3.074  
##  2:11   1st Qu.:15.00   1st Qu.:269.5   1st Qu.: 4.101  
##  3:11   Median :36.00   Median :486.5   Median : 7.358  
##         Mean   :33.55   Mean   :424.7   Mean   :10.703  
##         3rd Qu.:50.00   3rd Qu.:518.2   3rd Qu.:21.047  
##         Max.   :64.00   Max.   :550.2   Max.   :22.390

Now, we have the standard error and the mean of each group included in the dataset, which we can plot:

The above figure suggests that there are quite large differences between the means of each group as well as the standard errors.

We can also draw a box-plot for each time period:

The plots suggests that there are some outliers in Group 2.

Summary graphs

The summary measure method transforms measurements for each subject into a single measure thatcaptures essential feature of the individual in the sample. In this case, the the summary measure approach is applied to observiations by measuring the average weight after the first day of the trial until the end of the sample. Given that the data is measured at unequal intervals and that we are interested in learning about differences in the growth rate of each group, a regression coefficient would probably be a better measure, but we proceed as instructed.

The above figure shows that the mean weight of Group 2 is more variable and skewed to the right than for other groups. We can also see that there is an outlier in Group 2, which possibly biases the results.

Next we draw a new boxplot that eliminates the outlier in Group 2.

The figure confirms that there are significant differences between the means of the groups, but we must confirm this with a statistical test.

Means testing

We cannot use a t-test two test more than two samples, so ANOVA (Analysis of Variance) is used test a hypothesis when dealing with two or more groups. One-way ANOVA can be considered an extension of the t-test when at least three groups are tested.

## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that the p-value for Group is less than 0.1, which means that there are differences between the different groups at the 10% confidence level.

Analysis of longitudinal data: Part II

In this section, we use linear mixed effects models for normal response models to study the effect of two different treatments on schizophrenia. Linear mixed effects models for repeated measures data take in to account that subject’s responses depend on observed and unobserved individual characteristics. The unobserved variables are included in the model as random effects.

In the BPRS data subjects were randomly assigned to one of two treatment groups. Before the treatment, each subject was rated on the brief psychiatric rating scale (BPRS) and then during each of the following eight weeks. The BPRS scale is used to evaluate the presence of schizophrenia.

First we load the longitudinal data

BPRSL <- read.table("data/BPRSL.txt", header=TRUE)

and check the structure:

str(BPRSL)
## 'data.frame':    360 obs. of  4 variables:
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Week     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...

We need to change categorical values to factors in BPRSL data

BPRSL$treatment <- as.factor(BPRSL$treatment)
BPRSL$subject <- as.factor(BPRSL$subject)
BPRSL$Week <- as.numeric(BPRSL$Week) 
BPRSL$bprs <- as.numeric(BPRSL$bprs)

We can look at the differences between Wide form data and long form data. The original wide form data defines a variable for each observation period

##   treatment subject week0 week1 week2 week3 week4 week5 week6 week7 week8
## 1         1       1    42    36    36    43    41    40    38    47    51
## 2         1       2    58    68    61    55    43    34    28    28    28

However, the long form data defines variables for time and the observations, as can be seen in the table below.

##    subject treatment Week bprs
## 1        1         1    0   42
## 2        1         1    1   36
## 3        1         1    2   36
## 4        1         1    3   43
## 5        1         1    4   41
## 6        1         1    5   40
## 7        1         1    6   38
## 8        1         1    7   47
## 9        1         1    8   51
## 10       2         1    0   58
## 11       2         1    1   68
## 12       2         1    2   61
## 13       2         1    3   55
## 14       2         1    4   43
## 15       2         1    5   34
## 16       2         1    6   28
## 17       2         1    7   28
## 18       2         1    8   28

In the plot below, we ignore the repeated-measure nature of the data by abstracting from individuals and only plotting the data by treatment.

We can see that there are no clear differences between the two groups. The bprs measures seem to be decreasing over time in both groups.

If we look at the each subject separately, we can see that individual observations are correlated for each individual, suggesting the presence of random effects.

The scatterplot confirms this observation, given that observations for consecutive weeks appear to be strongly correlated.

Fitting Linear Mixed Models to BRPS data

Independence model

We first estimate the independence model, where the BPRS measure is the independent variable and time and treatment dummy are the explanatory variables and no random effects are assumed.

## 
## Call:
## lm(formula = bprs ~ Week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## Week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

We can see that the treatment dummy is not significant, which would suggest that the treatment does not effect the BPRS measure.

Random intercept model

Next we fit a linear model with a random intercept

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ Week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## Week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) Week  
## Week       -0.437       
## treatment2 -0.282  0.000

As can be seen from the results of the regression, the coefficient estimates are similar to those of the independence model, and the intercept and the coefficient for time are both significant, as before. The standard error for time is smaller in the random intercept model than in the independence model, given that assuming independence leads to a larger standard error for within-subject covariates. Surprisingly, the standard error for the dummy variable is smaller in the random intercept model than in the independence model, given that assuming independence usually reduces the standard error of between-subject coefficients.

Random intercept and slope model

Next, we fit a linear model with both a random slope and random intercept.

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ Week + treatment + (Week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7977 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           Week         0.9609  0.9803   -0.51
##  Residual             97.4304  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## Week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) Week  
## Week       -0.582       
## treatment2 -0.247  0.000

The coefficients are again quite similar as with the independence model and the random intercept model, although the standard error for the treatment dummy is slightly lower than in the random intercept model.

We can perform a likelihood ratio test to compare the models:

## Data: BPRSL
## Models:
## BPRSL_RIM: bprs ~ Week + treatment + (1 | subject)
## BPRSL_RIM1: bprs ~ Week + treatment + (Week | subject)
##            Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRSL_RIM   5 2748.7 2768.1 -1369.4   2738.7                           
## BPRSL_RIM1  7 2745.4 2772.6 -1365.7   2731.4 7.2721      2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The likelihood ratio test for the random intercept model versus the random intercept and slope model gives a chi-squared statistic of 7.2721 with 2 degrees of freedom, which suggests that the random intercept and slope model provides a better fit for the data.

Random intercept and slope model with treatment-time interaction

Finally, we fit a random intercept and slope model that allows for interaction between treatment and time.

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ Week * treatment + (Week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           Week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## Week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## Week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) Week   trtmn2
## Week        -0.650              
## treatment2  -0.424  0.469       
## Wek:trtmnt2  0.356 -0.559 -0.840
## Data: BPRSL
## Models:
## BPRSL_RIM1: bprs ~ Week + treatment + (Week | subject)
## BPRSL_RIM2: bprs ~ Week * treatment + (Week | subject)
##            Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRSL_RIM1  7 2745.4 2772.6 -1365.7   2731.4                           
## BPRSL_RIM2  8 2744.3 2775.4 -1364.1   2728.3 3.1712      1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that results are similar to those of the other linear models. The interaction term between treatment and time is not significant, which means that the changes in the rate of change in the bpsr measure does not significantly differ between groups. The likelihood ratio test comparing the random intercept and slope model to the random intercept and slope model with the interaction term shows that the former model provides a slightly better fit for the data.

Fitted vs actual data

We can also plot the the fitted values from the interaction model and the actual values from the BPRS data.

The above figure comparing fitted values of the interaction model and the observed values shows that the model fits the actual data quite well. However, the treatment does not appear to have predictive power about the rate of change of the bprs measure.

That’s it. Thanks for reading!